Picture for Jing-Xuan Zhang

Jing-Xuan Zhang

Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization

Add code
Jan 30, 2026
Viaarxiv icon

Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition

Add code
Oct 27, 2025
Viaarxiv icon

Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models

Add code
Feb 09, 2025
Viaarxiv icon

Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation

Add code
Feb 09, 2025
Figure 1 for Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation
Figure 2 for Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation
Figure 3 for Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation
Figure 4 for Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation
Viaarxiv icon

Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

Add code
Dec 06, 2022
Figure 1 for Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation
Figure 2 for Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation
Figure 3 for Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation
Figure 4 for Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation
Viaarxiv icon

Is Lip Region-of-Interest Sufficient for Lipreading?

Add code
Jun 02, 2022
Figure 1 for Is Lip Region-of-Interest Sufficient for Lipreading?
Figure 2 for Is Lip Region-of-Interest Sufficient for Lipreading?
Figure 3 for Is Lip Region-of-Interest Sufficient for Lipreading?
Figure 4 for Is Lip Region-of-Interest Sufficient for Lipreading?
Viaarxiv icon

TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

Add code
Nov 19, 2020
Figure 1 for TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos
Figure 2 for TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos
Figure 3 for TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos
Figure 4 for TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos
Viaarxiv icon

Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis

Add code
Jul 18, 2018
Figure 1 for Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis
Figure 2 for Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis
Figure 3 for Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis
Figure 4 for Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis
Viaarxiv icon